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Abstract 

Granger causality is a statistical notion of causal influence based on prediction via vector au- 
toregression. Developed originally in the field of econometrics, it has since found application in 
a broader arena, particularly in neuroscience. More recently transfer entropy, an information- 
theoretic measure of time-directed information transfer between jointly dependent processes, has 
gained traction in a similarly wide field. While it has been recognized that the two concepts must 
be related, the exact relationship has until now not been formally described. Here we show that for 
Gaussian variables. Granger causality and transfer entropy are entirely equivalent, thus bridging 
autoregressive and information-theoretic approaches to data-driven causal inference. 

PACS numbers: 87.10.Mn, 87.19.L, 87.19.lj, 87.19.lo, 89.70.Cf 
Keywords: Granger causality, transfer entropy, causal inference 



1 



The problem of inferring causal interactions from data has challenged scientists and 
philosophers for centuries [1]. One approach that has become increasingly popular over 
recent years was introduced originally by Wiener [2], and formalized in terms of linear au- 
toregression by Granger p>]. According to Wiener- Granger causality (G-causality), given sets 
of inter-dependent variables X and Y , it is said that ""K G-causes X" if, in an appropriate 
statistical sense, Y assists in predicting the future of X beyond the degree to which X al- 
ready predicts its own future. Importantly, identification of a G-causality interaction is not 
identical to identifying a physically instantiated causal interaction in a system. Although 
the two descriptions are intimately related [4, 5], physically instantiated causal structure can 
only be unambiguously identified by perturbing a system and observing the consequences 
[1]. Nonetheless, G-causality is pragmatic, well-defined, and has delivered many insights 
into the functional connectivity of systems in a variety of fields, particularly in neuroscience 
[6]. 

The information-theoretic notion of transfer entropy was formulated by Schreiber [7] 
as a measure of directed (time-asymmetric) information transfer between joint processes. 
In contrast to G-causality, transfer entropy is framed not in terms of prediction but in 
terms of resolution of uncertainty. One can say that "the transfer entropy from Y to JC" 
is the degree to which Y disambiguates the future of X beyond the degree to which X 
already disambiguates its own future. There is therefore an attractive symmetry between 
the notions ("predicts" ^ "disambiguates") which has been noted previously (see e.g. [8]) 
but never explicitly specified. In this Letter we show that under Gaussian assumptions 
they are in fact entirely equivalent. Our results therefore provide a framework for inferring 
causality which unifies information-theoretic and autoregressive approaches. 

We use a standard mathematical vector/matrix notation in which bold type generally 
denotes vector quantities and upper-case type denotes matrices or random variables, ac- 
cording to context. All vectors are considered to be row vectors. The symbol '"i"' denotes the 
transpose operator and '©' denotes concatenation of vectors, so that for 
and y = {yi, ... ,yrr,), X ® y is the I X {n + m) vector (xi, . . . , . . . , y^). 

Given jointly distributed multivariate random variables (i.e. random vectors) X,Y, we 
denote by S(X) the nx n matrix of covariances cov{Xi, Xj) and by H{X,Y) the nx m 
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matrix of cross-covariances cov{Xi, Ya). We then use 'S(X \ Y) to denote the n x n matrix 

i:{x\Y) = - s(x, Y) s(r)~^ s(x, Yy (i) 

defined when ^{Y) is invertible. S(X | Y) appears as the covariance matrix of the residuals 

of a hnear regression of X on "K [cf. eq. (3) below]; thus, by analogy with partial correlation 

[9] we term S(X | Y) the partial covariance [2n] of X given Y. 

Suppose we have a multivariate stochastic process Xt in discrete time [29] (i.e. the random 

variables Xu are jointly distributed). We use the notation X^^'' = Xt® Xt^i © ... © Xt^p^i 

(p) 

to denote X itself, along with p — 1 lags, so that X^ is a 1 x random vector for each t. 
Given the lag p, we use the shorthand notation X^ = X^^-^ for the lagged variable. 

Let X ,Y he jointly distributed random vectors and consider the linear regression 

X = a + Y ■ A + e (2) 

where the m x n matrix A comprises the regression coefficients, ex. = (cti, . . . , a^) are the 
constant terms and the random vector £ = {ci, . . . , comprises the residuals. The mean 
squared error (MSE) may then be written in terms of the covariance matrix of the residuals 
as E"^ = trace(S(£)). E"^ is just the sum of the variances of the Si, sometimes known as the 
total variance. Performing an Ordinary Least Squares (OLS) to find the coefficients A that 
minimize E"^ yields [assuming S(l^) invertible] A = Ti{Y)~^ S(X, Yy and we find that for 
the least squares fit the covariance matrix of the residuals is given by 

S(£) = S(X|Y) (3) 

with 5](X|"K) the partial covariance as defined by (1). We note that the same coeffi- 
cients A which minimize the total variance E"^ also minimize the generalized variance |S(£)| 
[10], where | ■ | denotes the determinant (this procedure is sometimes referred to as "Least 
Generalized Variance"; see e.g. [11]). 

If the residuals e can be taken to be uncorrelated with the regressors Y in (2) — as would 
be the case, for instance, for a multivariate autoregressive (MVAR) model — the residual 
covariance matrix can be derived directly from (2). Taking the covariance of both sides of 
(2) yields 

S(X) = ATS(r) A + S(£) (4) 
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Since the residuals and regressors are uncorrelated, we also have 



= S(r,£) 



i:{Y,X - (x-Y ■ A) 



E(X,r)T-S(r) A 



(5) 



Solving (5) for A and substituting in (4) we recover eq. (3) for S(£r). We note that eqs. (4) 
and (5) are essentially Yule- Walker equations [6] for the regression (2). 

Suppose now we have three jointly distributed, stationary [30] multivariate stochastic 
processes Xt, Yt, Zt ("variables" for brevity). Consider the regression models: 



so that the "predictee" variable X is regressed firstly on the previous p lags of itself plus r 
lags of the conditioning variable Z and secondly, in addition, on q lags of the "predictor" 
variable Y [31]. The G-causality of 1^ to X given Z is a measure of the extent to which 
inclusion of Y in the second model (7) reduces the prediction error of the first model (6). 

The standard measure of G-causality in the literature is defined for univariate predictor 
and predictee variables Y and X, and is given by the natural logarithm of the ratio of the 
residual variance in the restricted regression (6) to that of the unrestricted regression (7). 
In our notation [32] 



where the last equality follows from the general formula (3). By stationarity this expression 
does not depend on time t, so we drop the subscript when there is no danger of confusion. 
Note that the residual variance of the first regression will always be larger than or equal to 
that of the second, so that Ty^xiz > always. As regards statistical inference, it is known 
that the corresponding maximum likelihood estimator J-'y^x\z "will have (asymptotically 
for large samples) a x^-distribution under the null hypothesis J-'y->x\z = [12, 13] and a 
non-central x^-distribution under the alternative hypothesis J^y^x\z > [14, 15]. 




(6) 



(7) 
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Although rarely considered in the literature, there is no requirement in principle that 
either the predictee or predictor variable be univariate. In this Letter we address the general 
case where all variables are allowed to be multivariate; see [16] and [17] for motivation and 
discussion regarding this generalization. For the case of a multivariate predictor, eq. (8) 
above (with Y replaced by the bold-type Y) is a valid and consistent formula for G-causality. 
However, generalization to the case of a multivariate predictee is less clear cut and there 
does not yet appear to be a standard definition for G-causality in the literature. Here we use 
an extension first proposed by Geweke [14], in which the residual variance var(et) = 
is replaced by the generalized variance 



This formula always produces a non-negative quantity, and for a univariate predictee 
reduces to (8). Moreover, its estimator is also asymptotically x^-distributed. Geweke [14] 
lists a number of motivations for this choice, to which we add the result presented in 
this Letter. (An alternative formulation for multivariate G-causality is proposed in [16], 
although see [17] for more detailed discussion and further motivation for the form (9).) 

With Xt,Yt, Zt as before, the transfer entropy of Y to X given Z [7, 18] is defined as 
the difference between the entropy of X conditioned on its own past and the past of Z, and 
its entropy conditioned, in addition, on the past of Y: 



where H{-) denotes entropy and H{ - \ ■) conditional entropy. Again, by stationarity transfer 
entropy does not depend on time t, and Ty~^x\z > always. Ty^x\z may be understood 
as the degree of uncertainty of X resolved by the past of Y over and above the degree of 
uncertainty of X resolved by its own past. As with Granger causality, the transfer entropy 
literature generally deals only with univariate variables, although in this case the extension 
(10) to the multivariate case is unproblematic. 

We now turn to the equivalence with G-causality. For a multivariate Gaussian random 





rY^x\z = H{X\X-®Z-)-H{X\X-®Y-®Z~) 



(10) 
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variable X we have the well-known expression [19] 

H{X) = iln(|S(X)|) + inln(27re) 

for entropy in terms of the determinant of the covariance matrix, where n is the dimension 
of X. We now show that the conditional entropy H{X \ Y) for two jointly multivariate 
Gaussian variables may be expressed in terms of the determinant of the corresponding 
partial covariance matrix: 



H{X\Y) = iln(|S(X|l^)|) + inln(27re) 



(11) 



To see this, we have 



H{X\Y) = H{X ®Y) - H{Y) 

= |ln(|S(X©r)|)-lln(|S(l-)|) 



+ inln(27re) 



Now 



sex 



S(X) S(X,r) 

and from the block determinant identity [20] 



A B 
C D 



\D\ \A-BD-^C 



we have 



iSfX 



\T,{Y)\ ■ iSfXlrl 



from which we obtain (11) [33]. 

If, then, the processes Xj, 1^, Zt are jointly multivariate Gaussian (i.e. any finite subset 
of the component variables Xu^Ysai Zua has a joint Gaussian distribution) it follows from 
(11) that the expression (10) for transfer entropy becomes [34] 

|S(X|X"©Z-)| 



T 



Y^X\Z 



2 ^ S X X- 



(12) 



Y © Z 

Comparing (12) with (9) leads directly to our central result: if all processes are jointly 



Gaussian, then 



Y^X\Z 



2T 



Y^X\Z 



(13) 



so that Granger causality and transfer entropy are equivalent up to a factor of 2. This 
result holds, in particular, for a univariate predictee X with the standard definition (8) of 
G-causality. 

Empirically, numerical equivalence between G-causality and transfer entropy will depend 
on the method used to estimate the transfer entropy in sample. If it is assumed at the outset 
that the data may be reasonably modeled as Gaussian — and that, consequently, conditional 
entropies may be estimated from the appropriate sample covariance matrices — then, 
of course, numerical equivalence will be guaranteed. If, however, conditional entropies 
are estimated directly from sampled probability distributions, results will vary with the 
estimation technique. It is known that naive estimation of transfer entropy by partitioning 
of the state space is problematic [7] and that such estimators frequently fail to converge 
to the correct result [18]. In practice, more sophisticated techniques such as kernel [21] or 
/c-nearest neighour estimators [22, 23], will need to be deployed; however, such techniques 
may entail their own assumptions about the empirical distribution of the data (see [18] 
for a good discussion on these points). Furthermore, unlike G-causality, for which the 
(asymptotic) distribution of the sample statistic is known, we are not aware of any such 
general result for transfer entropy. Thus in particular significance testing for transfer 
entropy estimates is likely to be hard. 

Our result (13) provides for the first time a unified framework for data-driven causal 
inference that bridges information-theoretic and autoregressive methods. In particular, it 
opens new research possibilities in transforming findings originally developed in one domain 
into the other. For example, an advantage of the autoregressive approach is that it admits a 
straightforward decomposition by frequency [6, 14]. Our result now provides a foundation for 
the development of spectral implementations of transfer entropy. In the opposite direction, 
the invariance of information-theoretic quantities under general nonlinear transformations 
[18] could potentially prove useful in the identification of appropriate nonlinear autoregres- 
sive models [24, 25]. Preliminary work by the authors indicates, perhaps surprisingly, that 
under Gaussian assumptions there is nothing extra to account for by nonlinear extensions to 
G-causality, since a stationary Gaussian AR process is necessarily linear [17]. This finding 
has practical significance because sensitivity to nonlinear data features is often presented as 
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a reason to prefer transfer entropy to G-causality (see e.g. [26]). 

As regards Gaussian assumptions, although their appropriateness may be disputed in the 
context of specific physical systems, they are nevertheless widely employed in neuroscience, 
econometrics and beyond, frequently in the role of an analytical benchmark for subsequent 
more physically motivated analysis. In practice, given empirical data it is likely to be 
difficult to establish the extent to which Gaussian assumptions are tenable, particularly for 
highly multivariate datasets and limited sample sizes. Further research is thus required to 
characterize — both analytically and in sample — the manner in which the equivalence (13) 
breaks down when Gaussian assumptions fail. As a starting point it is known, at least, that 
in the generic (non-Gaussian) case, nonzero G-causality implies nonzero transfer entropy 
[27]. 

More generally, G-causality is typically implemented within the well-understood and 
easily applicable framework of MVAR modeling. This implementation, however, implies 
many assumptions about how to model the data. Transfer entropy by contrast, although 
on a theoretical level "model agnostic" (in the sense that it involves no presumptions about 
the joint statistical distribution of the data), may present severe difficulties in empirical 
application. Investigators, then, are free to use whichever practical methods best suit their 
data. Numerical issues aside, the analytical equivalence (13) furnishes the essential point 
that — under Gaussian assumptions — G-causality has a natural interpretation as transfer 
entropy and vice-versa. 
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